This lecture is a summary
of this whole course.
First, let's revisit the topics
that we covered in this course.
In the beginning, we talked about
the natural language processing and
how it can enrich text representation.
We then talked about how to mine
knowledge about the language,
natural language used to express the,
what's observing the world in text and
data.
In particular, we talked about
how to mine word associations.
We then talked about how
to analyze topics in text.
How to discover topics and analyze them.
This can be regarded as
knowledge about observed world,
and then we talked about how to mine
knowledge about the observer and
particularly talk about the, how to
mine opinions and do sentiment analysis.
And finally, we will talk about
the text-based prediction, which has to
do with predicting values of other real
world variables based on text data.
And in discussing this, we will also
discuss the role of non-text data,
which can contribute additional
predictors for the prediction problem,
and also can provide context for
analyzing text data, and
in particular we talked about how
to use context to analyze topics.
So here are the key high-level
take away messages from this cost.
I going to go over these major topics and
point out what are the key take-away
messages that you should remember.
First the NLP and text representation.
You should realize that NLP
is always very important for
any text replication because it
enriches text representation.
The more NLP the better text
representation we can have.
And this further enables more
accurate knowledge discovery,
to discover deeper knowledge,
buried in text.
However, the current estate of art
of natural energy processing is,
still not robust enough.
So, as an result,
the robust text mining technologies today,
tend to be based on world [INAUDIBLE].
And tend to rely a lot
on statistical analysis,
as we've discussed in this course.
And you may recall we've mostly
used word based representations.
And we've relied a lot on
statistical techniques,
statistical learning
techniques particularly.
In word-association mining and
analysis the important points first,
we are introduced the two concepts for
two basic and
complementary relations of words,
paradigmatic and syntagmatic relations.
These are actually very general
relations between elements sequences.
If you take it as meaning
elements that occur in similar
context in the sequence and elements
that tend to co-occur with each other.
And these relations might be also
meaningful for other sequences of data.
We also talked a lot about
test the similarity then we
discuss how to discover
paradynamic similarities compare
the context of words discover
words that share similar context.
At that point level,
we talked about representing text
data with a vector space model.
And we talked about some retrieval
techniques such as BM25 for
measuring similarity of text and
for assigning weights to terms,
tf-idf weighting, et cetera.
And this part is well-connected
to text retrieval.
There are other techniques that
can be relevant here also.
The next point is about
co-occurrence analysis of text, and
we introduce some information
theory concepts such as entropy,
conditional entropy,
and mutual information.
These are not only very useful for
measuring the co-occurrences of words,
they are also very useful for
analyzing other kind of data, and
they are useful for, for example, for
feature selection in text
categorization as well.
So this is another important concept,
good to know.
And then we talked about
the topic mining and analysis, and
that's where we introduce in
the probabilistic topic model.
We spent a lot of time to
explain the basic topic model,
PLSA in detail and this is, those are the
basics for understanding LDA which is.
Theoretically, a more opinion model, but
we did not have enough time to really
go in depth in introducing LDA.
But in practice,
PLSA seems as effective as LDA and
it's simpler to implement and
it's also more efficient.
In this part of Wilson videos is some
general concepts that would be useful to
know, one is generative model,
and this is a general method for
modeling text data and
modeling other kinds of data as well.
And we talked about the maximum life
erase data, the EM algorithm for
solving the problem of
computing maximum estimator.
So, these are all general techniques
that tend to be very useful
in other scenarios as well.
Then we talked about the text
clustering and the text categorization.
Those are two important building blocks
in any text mining application systems.
In text with clustering we talked
about how we can solve the problem by
using a slightly different mixture module
than the probabilistic topic model.
and we then also prefer to
view the similarity based
approaches to test for cuss word.
In categorization we also talk
about the two kinds of approaches.
One is generative classifies
that rely on to base word to
infer the condition of or
probability of a category given text data,
in deeper we'll introduce you should
use [INAUDIBLE] base in detail.
This is the practical use for technique,
for a lot of text, capitalization tasks.
We also introduce the some
discriminative classifiers,
particularly logistical regression,
can nearest labor and SBN.
They also very important, they are very
popular, they are very useful for
text capitalization as well.
In both parts, we'll also discuss
how to evaluate the results.
Evaluation is quite important because if
the matches that you use don't really
reflect the volatility of the method then
it would give you misleading results so
its very important to
get the variation right.
And we talked about variation of
categorization in detail was a lot of
specific measures.
Then we talked about the sentiment
analysis and the paradigm and
that's where we introduced
sentiment classification problem.
And although it's a special
case of text recalculation, but
we talked about how to extend or
improve the text recalculation method
by using more sophisticated features that
would be needed for sentiment analysis.
We did a review of some common use for
complex features for text analysis, and
then we also talked about how to
capture the order of these categories,
in sentiment classification, and
in particular we introduced ordinal
logistical regression then we also talked
about Latent Aspect Rating Analysis.
This is an unsupervised way of using
a generative model to understand and
review data in more detail.
In particular, it allows us to
understand the composed ratings of
a reviewer on different
aspects of a topic.
So given text reviews
with overall ratings,
the method allows even further
ratings on different aspects.
And it also allows us to infer,
the viewers laying their
weights on these aspects or
which aspects are more important to
a viewer can be revealed as well.
And this enables a lot of
interesting applications.
Finally, in the discussion of prediction,
we mainly talk about the joint mining
of text and non text data, as they
are both very important for prediction.
We particularly talked about how text data
can help non-text data and vice versa.
In the case of using non-text
data to help text data analysis,
we talked about
the contextual text mining.
We introduced the contextual PLSA as a
generalizing or generalized model of PLSA
to allows us to incorporate the context
of variables, such as time and location.
And this is a general way to allow us
to reveal a lot of interesting topic
of patterns in text data.
We also introduced the net PLSA,
in this case we used social network or
network in general of text
data to help analyze puppets.
And finally we talk about how
can be used as context to
mine potentially causal
Topics in text layer.
Now, in the other way of using text to
help interpret patterns
discovered from LAM text data,
we did not really discuss anything in
detail but just provide a reference but
I should stress that that's after a very
important direction to know about,
if you want to build a practical
text mining systems,
because understanding and
interpreting patterns is quite important.
So this is a summary of the key
take away messages, and
I hope these will be very
useful to you for building any
text mining applications or to you for
the starting of these algorithms.
And this should provide a good basis for
you to read from your research papers,
to know more about more of allowance for
other organisms or
to invent new hours in yourself.
So to know more about this topic,
I would suggest you to look
into other areas in more depth.
And during this short period
of time of this course,
we could only touch the basic concepts,
basic principles, of text mining and
we emphasize the coverage
of practical algorithms.
And this is after the cost
of covering algorithms and
in many cases we omit the discussion
of a lot of algorithms.
So to learn more about the subject
you should definitely learn more
about the natural language process
because this is foundation for
all text based applications.
The more NLP you can do, the better
the additional text that you can get, and
then the deeper knowledge
you can discover.
So this is very important.
The second area you should look into
is the Statistical Machine Learning.
And these techniques are now
the backbone techniques for
not just text analysis applications but
also for NLP.
A lot of NLP techniques are nowadays
actually based on supervised machinery.
So, they are very important
because they are a key
to also understanding some
advancing NLP techniques and
naturally they will provide more tools for
doing text analysis in general.
Now, a particularly interesting area,
called deep learning has attracted
a lot of attention recently.
It has also shown promise
in many application areas,
especially in speech and vision, and
has been applied to text data as well.
So, for example, recently there has
work on using deep learning to do
segment analysis to
achieve better accuracy.
So that's one example of [INAUDIBLE]
techniques that we weren't able to cover,
but that's also very important.
And the other area that has emerged
in status learning is the water and
baring technique, where they can
learn better recognition of words.
And then these better recognitions will
allow you confuse similarity of words.
As you can see,
this provides directly a way to discover
the paradigmatic relations of words.
And results that people have got,
so far, are very impressive.
That's another promising technique
that we did not have time to touch,
but, of course,
whether these new techniques
would lead to practical useful techniques
that work much better than the current
technologies is still an open
question that has to be examined.
And no serious evaluation
has been done yet.
In, for example, examining
the practical value of word embedding,
other than word similarity and
basic evaluation.
But nevertheless,
these are advanced techniques
that surely will make impact
in text mining in the future.
So its very important to
know more about these.
Statistical learning is also the key to
predictive modeling which is very crucial
for many big data applications and we did
not talk about that predictive modeling
component but this is mostly about
the regression or categorization
techniques and this is another reason
why statistical learning is important.
We also suggest that you learn more about
data mining, and that's simply because
general data mining algorithms can always
be applied to text data, which can be
regarded as as special
case of general data.
So there are many applications
of data mining techniques.
In particular for example, pattern
discovery would be very useful to generate
the interesting features for test analysis
and the reason that an information network
that mining techniques can also be used
to analyze text information at work.
So these are all good to know.
In order to develop effective
text analysis techniques.
And finally, we also recommend you to
learn more about the text retrieval,
information retrieval, of search engines.
This is especially important if you
are interested in building practical text
application systems.
And a search ending would
be an essential system
component in any text-based applications.
And that's because texts data
are created for humans to consume.
So humans are at the best position
to understand text data and
it's important to have human in the loop
in big text data applications, so
it can in particular help text
mining systems in two ways.
One is through effectively reduce
the data size from a large collection to
a small collection with the most
relevant text data that only matter for
the particular interpretation.
So the other is to provide a way to
annotate it, to explain parents,
and this has to do with
knowledge providence.
Once we discover some knowledge,
we have to figure out whether or
not the discovery is really reliable.
So we need to go back to
the original text to verify that.
And that is why the search
engine is very important.
Moreover, some techniques
of information retrieval,
for example BM25, vector space and
are also very useful for text data mining.
We only mention some of them,
but if you know more about
text retrieval you'll see that there
are many techniques that are used for it.
Another technique that it's used for
is indexing technique that enables quick
response of search engine to a user's
query, and such techniques can be
very useful for building efficient
text mining systems as well.
So, finally, I want to remind
you of this big picture for
harnessing big text data that I showed
you at your beginning of the semester.
So in general, to deal with
a big text application system,
we need two kinds text,
text retrieval and text mining.
And text retrieval, as I explained,
is to help convert big text data into
a small amount of most relevant data for
a particular problem, and can also help
providing knowledge provenance,
help interpreting patterns later.
Text mining has to do with further
analyzing the relevant data to discover
the actionable knowledge that can be
directly useful for decision making or
many other tasks.
So this course covers text mining.
And there's a companion course
called Text Retrieval and
Search Engines that covers text retrieval.
If you haven't taken that course,
it would be useful for you to take it,
especially if you are interested
in building a text caching system.
And taking both courses will give you
a complete set of practical skills for
building such a system.
So in [INAUDIBLE]
I just would like to thank you for
taking this course.
I hope you have learned useful knowledge
and skills in test mining and [INAUDIBLE].
As you see from our discussions
there are a lot of opportunities for
this kind of techniques and
there are also a lot of open channels.
So I hope you can use what you have
learned to build a lot of use for
applications will benefit society and
to also join
the research community to discover new
techniques for text mining and benefits.
Thank you.
[MUSIC]

